Game Analytics: From Exploratory Data Analysis to Predictive Modeling

Author

Hoang Son Lai

Published

November 17, 2025

1. Data Overview & Processing

Code
# Load and clean the data
game_data <- read.csv("data/game_sessions.csv", stringsAsFactors = FALSE)

# Data cleaning and preprocessing
game_data_clean <- game_data %>%
  mutate(
    start_time = as.POSIXct(start_time, format = "%Y-%m-%dT%H:%M:%OSZ"),
    end_time = as.POSIXct(end_time, format = "%Y-%m-%dT%H:%M:%OSZ"),
    death_reason = as.factor(death_reason),
    # Handle missing end_time
    game_duration = ifelse(is.na(game_duration), 0, game_duration),
    # Create performance metrics
    score_per_second = ifelse(game_duration > 0, score / game_duration, 0),
    accuracy = ifelse(bullets_fired > 0, ufos_shot / bullets_fired, 0)
  ) %>%
  filter(!is.na(start_time))  
Code
# Display basic information
cat("Cleaned Dataset Dimensions:", dim(game_data_clean), "\n")
Cleaned Dataset Dimensions: 300 12 
Code
cat("Date Range:", as.character(min(game_data_clean$start_time)), "to", 
    as.character(max(game_data_clean$start_time)), "\n")
Date Range: 2025-11-16 12:33:32.897 to 2025-11-17 04:58:20.094 
Code
head(game_data_clean)
                   id          start_time            end_time score
1 plane_1763296412897 2025-11-16 12:33:32 2025-11-16 12:33:40     8
2 plane_1763296421212 2025-11-16 12:33:41 2025-11-16 12:33:44     6
3 plane_1763296425226 2025-11-16 12:33:45 2025-11-16 12:33:45     0
4 plane_1763296426741 2025-11-16 12:33:46 2025-11-16 12:33:47     0
5 plane_1763296427702 2025-11-16 12:33:47 2025-11-16 12:33:48     0
6 plane_1763296428948 2025-11-16 12:33:48 2025-11-16 12:33:49     0
  coins_collected ufos_shot bullets_fired death_reason game_duration
1               2         2            33         pipe             7
2               0         2            31         pipe             3
3               0         0             0       ground             0
4               0         0             0       ground             0
5               0         0             0       ground             0
6               0         0             0       ground             0
  pipes_passed score_per_second   accuracy
1            3         1.142857 0.06060606
2            1         2.000000 0.06451613
3            0         0.000000 0.00000000
4            0         0.000000 0.00000000
5            0         0.000000 0.00000000
6            0         0.000000 0.00000000
Code
variable_description <- tibble(
  Variable = c(
    "id",
    "start_time",
    "end_time",
    "score",
    "coins_collected",
    "ufos_shot",
    "bullets_fired",
    "death_reason",
    "game_duration",
    "pipes_passed",
    "score_per_second",
    "accuracy"
  ),
  Description = c(
    "Unique session identifier",
    "Timestamp when the game session started",
    "Timestamp when the game session ended",
    "Final score achieved in the session. Score = coins_collected + (3 × ufos_shot)",
    "Number of coins collected by the player",
    "Number of UFO enemies shot",
    "Total number of bullets fired",
    "Cause of death (collision type / hazard)",
    "Total session duration in seconds",
    "Number of pipes the player successfully passed",
    "Score normalized by session duration (score ÷ game_duration)",
    "Shooting accuracy (ufos_shot ÷ bullets_fired)"
  ),
  Type = c(
    "Character",
    "Datetime",
    "Datetime",
    "Integer",
    "Integer",
    "Integer",
    "Integer",
    "Categorical",
    "Numeric",
    "Integer",
    "Numeric",
    "Numeric"
  )
)

variable_description %>%
  gt() %>%
  tab_header(
    title = md("**Variable Description - Plane Game Analytics**")
  ) %>%
  cols_width(
    Variable ~ px(160),
    Description ~ px(420),
    Type ~ px(120)
  ) %>%
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_column_labels()
  )
Table 1
Variable Description - Plane Game Analytics
Variable Description Type
id Unique session identifier Character
start_time Timestamp when the game session started Datetime
end_time Timestamp when the game session ended Datetime
score Final score achieved in the session. Score = coins_collected + (3 × ufos_shot) Integer
coins_collected Number of coins collected by the player Integer
ufos_shot Number of UFO enemies shot Integer
bullets_fired Total number of bullets fired Integer
death_reason Cause of death (collision type / hazard) Categorical
game_duration Total session duration in seconds Numeric
pipes_passed Number of pipes the player successfully passed Integer
score_per_second Score normalized by session duration (score ÷ game_duration) Numeric
accuracy Shooting accuracy (ufos_shot ÷ bullets_fired) Numeric
Code
# Check NA
na_table <- game_data_clean %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "Column", values_to = "NA Count") %>%
  arrange(desc(`NA Count`))

# Display by kable
kable(na_table)
Table 2: Number of NA values per column in game_data_clean
Column NA Count
id 0
start_time 0
end_time 0
score 0
coins_collected 0
ufos_shot 0
bullets_fired 0
death_reason 0
game_duration 0
pipes_passed 0
score_per_second 0
accuracy 0
Code
# Summary statistics
game_data_clean %>%
  select(score, game_duration, coins_collected, ufos_shot, bullets_fired, pipes_passed, accuracy) %>%
  summary() %>%
  kable()
Table 3: Summary statistics for key variables
score game_duration coins_collected ufos_shot bullets_fired pipes_passed accuracy
Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. :0.00000
1st Qu.: 3.00 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 1.00 1st Qu.: 7.75 1st Qu.: 1.00 1st Qu.:0.02941
Median : 7.00 Median : 4.000 Median : 1.000 Median : 2.00 Median : 19.00 Median : 2.00 Median :0.09839
Mean : 12.59 Mean : 9.053 Mean : 3.863 Mean : 2.91 Mean : 23.19 Mean : 4.58 Mean :0.09363
3rd Qu.: 17.25 3rd Qu.:14.000 3rd Qu.: 5.000 3rd Qu.: 4.00 3rd Qu.: 33.00 3rd Qu.: 6.25 3rd Qu.:0.14733
Max. :134.00 Max. :96.000 Max. :56.000 Max. :28.00 Max. :171.00 Max. :57.00 Max. :0.28846

2. Exploratory Data Analysis

2.1. Distribution Overview

Code
# Tidy dataset long & clean variable names
game_long <- game_data_clean %>%
  select(score, game_duration, accuracy, coins_collected, ufos_shot) %>%
  pivot_longer(
    cols = everything(),
    names_to = "variable",
    values_to = "value"
  ) %>%
  mutate(
    variable = str_replace_all(variable, "_", " "),  
    variable = str_to_title(variable)                 
  )

# Facet color
facet_colors <- c(
  "Score"            = "#4e79a7",
  "Game Duration"    = "#f28e2b",
  "Accuracy"         = "#59a14f",
  "Coins Collected"  = "#e15759",
  "Ufos Shot"        = "#76b7b2"
)

# Plot
ggplot(game_long, aes(x = value, fill = variable)) +
  geom_histogram(
    bins = 30,
    alpha = 0.7,
    color = "white",
    linewidth = 0.3
  ) +
  scale_fill_manual(values = facet_colors) +
  facet_wrap(~ variable, scales = "free", ncol = 3) +
  labs(
    title = "Distribution of Game Performance Metrics",
    x = "Value",
    y = "Count",
    fill = NULL
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position    = "none",
    strip.text         = element_text(face = "bold", size = 9),
    plot.title         = element_text(face = "bold", size = 14, hjust = 0.5),
    panel.border       = element_rect(
      color = "white",
      fill  = NA,
      linewidth = 1
    ),
    axis.title.x       = element_text(size = 9, margin = margin(t = 5)),
    axis.title.y       = element_text(size = 9, margin = margin(r = 5)),
    axis.text.x        = element_text(size = 6),
    axis.text.y        = element_text(size = 6)
  )
Figure 1: Distribution of Game Performance Metrics
Code
# Bar chart death reason
death_reason_data <- game_data_clean %>%
  count(death_reason, sort = TRUE) %>%
  mutate(
    death_reason_clean = death_reason %>%
      str_replace_all("_", " ") %>% 
      str_to_title()
  )

highchart() %>%
  hc_chart(type = "bar") %>%
  hc_title(text = "Common Causes Of Death") %>%
  hc_xAxis(
    categories = death_reason_data$death_reason_clean,
    title = list(text = "Death Reason"),
    labels = list(style = list(fontSize = "11px"))
  ) %>%
  hc_yAxis(
    title = list(text = "Number of games"),
    labels = list(style = list(fontSize = "11px"))
  ) %>%
  hc_add_series(
    name = "Count",
    data = death_reason_data$n,
    color = "#1f77b4"
  ) %>%
  hc_tooltip(
    useHTML = TRUE,
    pointFormat = "<b>Count:</b> {point.y}"    
  ) %>%
  hc_plotOptions(
    series = list(
      borderWidth = 0,
      dataLabels = list(enabled = FALSE)
    )
  ) %>%
  hc_legend(enabled = FALSE) %>%              
  hc_exporting(enabled = TRUE)
Figure 2: Common Causes Of Death
Code
# Heatmap correlation
colors <- mako(10)
stops <- seq(0, 1, length.out = length(colors))
custom_colorscale <- lapply(seq_along(colors), function(i) {
  list(stops[i], colors[i])
})

num_vars <- game_data_clean %>%
  select(
    score, game_duration, coins_collected, ufos_shot, 
    bullets_fired, pipes_passed, accuracy
  )

cor_matrix <- cor(num_vars, use = "complete.obs")

plot_ly(
  x = colnames(cor_matrix),
  y = colnames(cor_matrix),
  z = cor_matrix,
  type = "heatmap",
  colorscale = custom_colorscale,
  xgap = 2,
  ygap = 2,
  showscale = TRUE
) %>%
  layout(
    title = list(text = "Correlation Matrix for Key Performance Metrics", x = 0.5),
    xaxis = list(title = "", tickangle = 45, tickfont = list(size = 10)),
    yaxis = list(title = "", autorange = "reversed", tickfont = list(size = 10)),
    plot_bgcolor = "white",
    paper_bgcolor = "white",
    margin = list(l = 80, r = 80, b = 80, t = 80)
  )
Figure 3: Correlation Matrix for Key Performance Metrics

Figure 3

2.2 Death Reason Deep-Dive

Code
# Survival Timeline by Death Reason
format_bin <- function(x) {
  x <- gsub("\\(", "", x)
  x <- gsub("\\]", "", x)
  x <- gsub("\\[", "", x)
  x <- gsub("\\)", "", x)
  x <- gsub(",", "-", x)
  x
}

game_data_binned <- game_data_clean %>%
  mutate(duration_bin = cut(game_duration,
                            breaks = seq(0, 140, by = 5),
                            include.lowest = TRUE)) %>%
  filter(!is.na(duration_bin)) %>%
  mutate(duration_label = format_bin(as.character(duration_bin))) %>%
  count(duration_label, death_reason, name = "count")

duration_levels <- format_bin(as.character(levels(cut(seq(0, 100, by = 5),
                                                      breaks = seq(0, 140, by = 5),
                                                      include.lowest = TRUE))))
game_data_binned$duration_label <- factor(game_data_binned$duration_label,
                                          levels = duration_levels)

timeline_plot <- ggplot(
  game_data_binned,
  aes(
    x = duration_label,
    y = count,
    color = death_reason,
    group = death_reason,
    text = paste0(
      "<b>Death Reason:</b> ", death_reason, "<br>",
      "<b>Duration:</b> ", duration_label, " sec<br>",
      "<b>Count:</b> ", count
    )
  )
) +
  geom_line(size = 0.7) +
  geom_point(size = 1.5) +
  labs(
    title = "Survival Timeline by Death Reason",
    x = "Game Duration (seconds)",
    y = "Number of Deaths",
    color = "Death Reason"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, margin = margin(t = 5)),
    axis.text.y = element_text(margin = margin(r = 5))
  )

ggplotly(timeline_plot, tooltip = "text") %>%
  layout(
    title = list(
      text = "<b>Survival Timeline by Death Reason</b>",  
      x = 0.5,          
      xanchor = "center",
      font = list(size = 17)  
    ),
    legend = list(
      orientation = "h",
      x = 0.5,
      xanchor = "center",
      y = -0.25,
      yanchor = "top"
    ),
    xaxis = list(
      title_standoff = 20
    ),
    yaxis = list(
      title_standoff = 20
    ),
    margin = list(b = 160)  
  )
Figure 4: Survival Timeline by Death Reason
Code
#  Distribution of score by death_reason
stats <- game_data_clean %>%
  group_by(death_reason) %>%
  summarise(
    count = n(),
    mean  = mean(score),
    min   = min(score),
    q1    = quantile(score, 0.25),
    median= median(score),
    q3    = quantile(score, 0.75),
    max   = max(score)
  )

df <- left_join(game_data_clean, stats, by = "death_reason")

p <- plot_ly()

unique_reasons <- unique(df$death_reason)

for (dr in unique_reasons) {

  dsub <- df %>% filter(death_reason == dr)

  cd <- as.matrix(dsub[, c("count","mean","min","q1","median","q3","max")])

  p <- add_trace(
    p,
    data = dsub,
    x = ~death_reason,
    y = ~score,
    type = "violin",
    name = dr,
    box = list(visible = TRUE),
    meanline = list(visible = TRUE),
    customdata = cd,
    hovertemplate = paste(
      "<b>Death reason:</b> ", dr, "<br>",
      "<b>Score:</b> %{y}<br><br>",
      "<b>Count:</b> %{customdata[0]}<br>",
      "<b>Mean:</b> %{customdata[1]:.2f}<br>",
      "<b>Min:</b> %{customdata[2]}<br>",
      "<b>Q1:</b> %{customdata[3]}<br>",
      "<b>Median:</b> %{customdata[4]}<br>",
      "<b>Q3:</b> %{customdata[5]}<br>",
      "<b>Max:</b> %{customdata[6]}<extra></extra>"
    )
  )
}

 p %>% layout(
  title = "Score Distribution by Death Reason",
  xaxis = list(title = "Death Reason"),
  yaxis = list(title = "Score")
)
Figure 5: Score Distribution by Death Reason
Code
# Expected Value of Score Lost per Death Type
ev_loss <- game_data_clean %>%
    group_by(death_reason) %>%
    rename(`Death reason` = death_reason) %>%
    summarise(
        `Mean score` = mean(score),
        `Median score` = median(score),
        `Count of deaths` = n(),
        .groups = 'drop'
    ) %>%
    arrange(desc(`Mean score`))

ev_loss %>% kable()
Table 4: Expected Value of Score per Death Reason
Death reason Mean score Median score Count of deaths
ufo_collision 30.3333333 30.0 3
enemy_bullet 28.2812500 22.5 32
pipe 14.5054945 8.0 182
ceiling 10.1250000 3.0 8
ground 0.8133333 0.0 75

2.3 Behavioral Feature Engineering

Code
# Behavioral features created 
game_data_enhanced <- game_data_clean %>%
    mutate(
        aggressiveness = ifelse(game_duration > 0, bullets_fired / game_duration, 0),
        efficiency = ifelse(game_duration > 0, score / game_duration, 0),
        risk_taking = ifelse(pipes_passed > 0, ufos_shot / pipes_passed, 0),
        coin_rate = ifelse(game_duration > 0, coins_collected / game_duration, 0),
        ufo_rate = ifelse(game_duration > 0, ufos_shot / game_duration, 0),
        pipe_pass_rate = ifelse(game_duration > 0, pipes_passed / game_duration, 0),
        complexity = bullets_fired + ufos_shot + pipes_passed,
        skill_tier = cut(score, 
                        breaks = c(-1, 4, 9, 14, 29, 49, Inf),
                        labels = c("0-4", "5-9", "10-14", "15-29", "30-49", "50+"))
    )
Code
# Behavioral features
behavioral_vars <- game_data_enhanced %>%
  select(aggressiveness, efficiency, accuracy, risk_taking,
         coin_rate, ufo_rate, pipe_pass_rate, complexity)

# Correlation matrix
behavioral_cor <- cor(behavioral_vars, use = "complete.obs")

# Heatmap Plotly 
plot_ly(
  x = colnames(behavioral_cor),
  y = colnames(behavioral_cor),
  z = behavioral_cor,
  type = "heatmap",
  colorscale = "RdBu",   
  zmin = -1,
  zmax = 1,
  xgap = 2,
  ygap = 2,
  showscale = TRUE
) %>%
  layout(
    title = list(text = "Behavioral Features Correlation Matrix", x = 0.5),
    xaxis = list(title = "", tickangle = 45, tickfont = list(size = 10)),
    yaxis = list(title = "", autorange = "reversed", tickfont = list(size = 10)),
    plot_bgcolor = "white",
    paper_bgcolor = "white",
    margin = list(l = 80, r = 80, b = 80, t = 80)
  )
Figure 6: Behavioral Features Correlation Matrix

2.4. Session Progression Analysis

Code
# (A) Score vs Duration with trendline
score_duration_plot <- ggplot(game_data_enhanced, 
                            aes(x = game_duration, y = score)) +
    geom_point(alpha = 0.6, color = "#1f77b4") +
    geom_smooth(method = "loess", color = "#ff7f0e", se = TRUE) +
    labs(title = "Score vs Game Duration with Trendline",
         x = "Game Duration (seconds)",
         y = "Score") +
    theme_minimal()

ggplotly(score_duration_plot)
Figure 7: Score vs Game Duration with Trendline
Code
# (B) Bullets vs UFO Shot 
efficiency_plot <- ggplot(game_data_enhanced,
                         aes(x = bullets_fired, y = ufos_shot, color = skill_tier)) +
    geom_point(alpha = 0.7) +
    geom_smooth(method = "lm", se = FALSE) +
    labs(title = "Bullets Fired vs UFOs Shot",
         x = "Bullets Fired", 
         y = "UFOs Shot",
         color = "Skill Tier (by Score)") +
    theme_minimal()

ggplotly(efficiency_plot)
Figure 8: Bullets Fired vs UFOs Shot

2.5 Clusterable Structure Check

Code
# PCA Analysis
cluster_data <- game_data_enhanced %>%
    select(score, game_duration, coins_collected, bullets_fired,
           ufos_shot, pipes_passed, aggressiveness, efficiency,
           accuracy, risk_taking) %>%
    scale()

pca_result <- prcomp(cluster_data, scale. = TRUE)

pca_df <- as.data.frame(pca_result$x)
pca_df$death_reason <- game_data_enhanced$death_reason
pca_df$score <- game_data_enhanced$score
pca_df$game_duration <- game_data_enhanced$game_duration
pca_df$accuracy <- round(game_data_enhanced$accuracy, 3)
pca_df$coins_collected <- game_data_enhanced$coins_collected
pca_df$ufos_shot <- game_data_enhanced$ufos_shot
pca_df$pipes_passed <- game_data_enhanced$pipes_passed

pca_df$tooltip <- paste(
    "Death reason:", pca_df$death_reason, "<br>",
    "Score:", pca_df$score, "<br>",
    "Duration:", pca_df$game_duration, "s<br>",
    "Accuracy:", pca_df$accuracy, "<br>",
    "Coins:", pca_df$coins_collected, "<br>",
    "UFOs:", pca_df$ufos_shot, "<br>",
    "Pipes:", pca_df$pipes_passed
)

pca_plot <- plot_ly(pca_df, x = ~PC1, y = ~PC2, 
                   color = ~death_reason,
                   type = 'scatter', mode = 'markers',
                   text = ~tooltip,
                   hoverinfo = 'text',
                   marker = list(size = 8, opacity = 0.7)) %>%
    layout(title = "PCA - Player Behavior Patterns by Death Reason",
           xaxis = list(title = paste0("PC1 (", 
                                     round(100 * pca_result$sdev[1]^2/sum(pca_result$sdev^2), 1), "%)")),
           yaxis = list(title = paste0("PC2 (", 
                                     round(100 * pca_result$sdev[2]^2/sum(pca_result$sdev^2), 1), "%)")))

pca_plot
Figure 9: PCA - Player Behavior Patterns by Death Reason
Code
# PCA Loadings - Meaning of Principal components
pca_loadings <- as.data.frame(pca_result$rotation[, 1:2])
pca_loadings$feature <- rownames(pca_loadings)

loadings_plot <- ggplot(pca_loadings, aes(x = PC1, y = PC2, label = feature)) +
    geom_point(size = 3, color = "blue") +
    geom_text_repel(size = 4, max.overlaps = 10) +
    geom_vline(xintercept = 0, linetype = "dashed", alpha = 0.5) +
    geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.5) +
    labs(title = "PCA Loadings - Meaning of Principal Components",
         x = paste0("PC1 (", round(100 * pca_result$sdev[1]^2/sum(pca_result$sdev^2), 1), "%)"),
         y = paste0("PC2 (", round(100 * pca_result$sdev[2]^2/sum(pca_result$sdev^2), 1), "%)")) +
    theme_minimal()

loadings_plot
Figure 10: PCA Loadings - Meaning of Principal Components

3. Bootstrapping Data for Machine Learning

Code
# Sort by time and split data
game_sorted <- game_data_enhanced %>% arrange(start_time)
train_base <- head(game_sorted, 250)
test_holdout <- tail(game_sorted, 50)

# Bootstrap training data
set.seed(123)
bootstrap_size <- 10000
train_bootstrapped <- train_base %>% 
    slice_sample(n = bootstrap_size, replace = TRUE) %>% 
    mutate(is_synthetic = TRUE)

cat("Original Train Size:", nrow(train_base), "\n")
Original Train Size: 250 
Code
cat("Bootstrapped Train Size:", nrow(train_bootstrapped), "\n")
Bootstrapped Train Size: 10000 
Code
cat("Holdout Test Size:", nrow(test_holdout), "\n")
Holdout Test Size: 50 
Code
# Compare distribution real vs bootstrapped
compare_plot <- ggplot() +
    geom_density(data = train_base, aes(x = score, color = "Real"), size = 1) +
    geom_density(data = train_bootstrapped, aes(x = score, color = "Bootstrapped"), size = 1, alpha = 0.7) +
    labs(title = "Score Distribution: Real vs Bootstrapped Data",
         x = "Score", y = "Density", color = "Data Type") +
    theme_minimal()

compare_plot
Figure 11: Score Distribution: Real vs Bootstrapped Data

4. Segmentation (Unsupervised Learning)

Code
# Features for clustering
cluster_features_enhanced <- train_bootstrapped %>% 
    select(score, game_duration, coins_collected, bullets_fired,
           ufos_shot, pipes_passed, aggressiveness, efficiency,
           accuracy, risk_taking)

scaled_features_enhanced <- scale(cluster_features_enhanced)

# KMeans clustering with 3 clusters
set.seed(123)
kmeans_enhanced <- kmeans(scaled_features_enhanced, centers = 3, nstart = 25)
train_bootstrapped$cluster_enhanced <- as.factor(kmeans_enhanced$cluster)

# Visualize clusters
fviz_cluster(kmeans_enhanced, data = scaled_features_enhanced,
            geom = "point", ellipse.type = "convex",
            ggtheme = theme_minimal(),
            main = "Enhanced Player Segmentation (K-Means)")
Figure 12: Enhanced Player Segmentation (K-Means)
Code
# Cluster profiles
cluster_profiles <- train_bootstrapped %>%
    group_by(cluster_enhanced) %>%
    rename(`Cluster enhanced` = cluster_enhanced) %>%
    summarise(
        Count = n(),
        `Avg Score` = mean(score),
        `Avg Duration` = mean(game_duration),
        `Avg Coins` = mean(coins_collected),
        `Avg UFOs` = mean(ufos_shot),
        `Avg Bullets` = mean(bullets_fired),
        `Avg Aggressiveness` = mean(aggressiveness),
        `Avg Efficiency` = mean(efficiency),
        `Avg Accuracy` = mean(accuracy),
        .groups = 'drop'
    )

cluster_profiles %>% kable()
Table 5: Cluster profiles
Cluster enhanced Count Avg Score Avg Duration Avg Coins Avg UFOs Avg Bullets Avg Aggressiveness Avg Efficiency Avg Accuracy
1 956 52.4592050 37.00628 19.1140167 11.1150628 70.791841 2.0757690 1.5083939 0.1591852
2 3061 0.4168572 1.40477 0.2198628 0.0656648 1.815093 0.3103436 0.0581862 0.0039203
3 5983 10.7751964 7.54588 2.6095604 2.7218787 23.927628 5.4787370 1.8245718 0.1212711

5. Predictive Modeling (Supervised Learning)

5.1 Score Forecasting (Regression)

We use a Random Forest model to predict the final score based on gameplay metrics.

Code
library(randomForest)
library(caret)

# Define features
features <- c("coins_collected", "ufos_shot", "bullets_fired", "game_duration", "pipes_passed")

# Train Random Forest on Bootstrapped Data
rf_model_score <- randomForest(
  as.formula(paste("score ~", paste(features, collapse = "+"))),
  data = train_bootstrapped,
  ntree = 100,
  importance = TRUE
)

# Predict on Holdout Test Set
predictions_rf <- predict(rf_model_score, newdata = test_holdout)

# Evaluate Performance (RMSE & R-squared)
rmse_val <- RMSE(predictions_rf, test_holdout$score)
r2_val <- R2(predictions_rf, test_holdout$score)

cat("Random Forest Performance on Holdout Set:\n")
Random Forest Performance on Holdout Set:
Code
cat("RMSE:", round(rmse_val, 2), "\n")
RMSE: 1.83 
Code
cat("R-Squared:", round(r2_val, 4), "\n")
R-Squared: 0.9807 
Code
# Variable Importance Plot
varImpPlot(rf_model_score, main = "Feature Importance for Score Prediction")

5.2 Survival Analysis (Logistic Regression)

We predict whether a player will survive past a specific “expert” threshold (e.g., 30 seconds). This is a binary classification problem.

Code
# Define 'Survival' as lasting longer than 30 seconds
threshold <- 30

train_bootstrapped <- train_bootstrapped %>%
  mutate(survived_expert = as.factor(ifelse(game_duration > threshold, 1, 0)))

test_holdout <- test_holdout %>%
  mutate(survived_expert = as.factor(ifelse(game_duration > threshold, 1, 0)))

# Train Logistic Regression
# We remove variables that directly calculate duration/score to prevent data leakage, focusing on behavioral counts
log_model <- glm(survived_expert ~ bullets_fired + ufos_shot + coins_collected, 
                 data = train_bootstrapped, 
                 family = "binomial")

# Predict probabilities on Test Set
probs_survival <- predict(log_model, newdata = test_holdout, type = "response")
preds_survival <- ifelse(probs_survival > 0.5, 1, 0)

# Confusion Matrix
conf_matrix <- confusionMatrix(as.factor(preds_survival), test_holdout$survived_expert)

cat("Logistic Regression Accuracy:", round(conf_matrix$overall['Accuracy'], 4), "\n")
Logistic Regression Accuracy: 0.96 
Code
print(conf_matrix$table)
          Reference
Prediction  0  1
         0 47  1
         1  1  1

6. Business Insights & Recommendations

Based on the analysis above, we derive the following actionable insights:

6.1. Difficulty Balancing:

Observation: The death_reason analysis highlights the most common obstacles (e.g., pipes vs. enemies). If ‘pipe’ collisions are disproportionately high early in the game, the initial difficulty curve may be too steep.

Recommendation: Adjust the gap size or spawn rate of the leading cause of death in the first 10 seconds of gameplay to improve retention.

6.2. Player Segmentation Strategy:

Observation: K-Means clustering identified distinct groups. (Refer to cluster table: e.g., High-duration/low-coin collectors vs. Aggressive shooters).

Recommendation: Introduce targeted rewards.

  • For ‘Survivors’ (High duration, low action): Introduce time-based achievements.

  • For ‘Shooters’ (High bullets/UFOs): Offer weapon skins or visual upgrades for combat milestones.

6.3. Predictive Engagement:

Observation: The Random Forest model shows that specific actions (like coins_collected or ufos_shot) are strong predictors of high scores.

Recommendation: Create a tutorial or “Daily Mission” focusing on these high-value actions to teach new players how to achieve higher scores effectively.

6.4. Monetization Opportunities:

Observation: Players who survive past the 30-second threshold (analyzed in the Logistic Regression) show higher engagement.

Recommendation: Trigger “Continue?” ads or special offers only after a player has demonstrated this “expert” survival trait, as they are more invested in the session than a player who dies instantly.